Annotating Sanskrit Corpus: Adapting IL-POSTS
نویسندگان
چکیده
In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b) , developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.
منابع مشابه
Annotating Uncertainty in Hungarian Webtext
Uncertainty detection has been a popular topic in natural language processing, which manifested in the creation of several corpora for English. Here we show how the annotation guidelines originally developed for English standard texts can be adapted to Hungarian webtext. We annotated a small corpus of Facebook posts for uncertainty phenomena and we illustrate the main characteristics of such te...
متن کاملBelieve Me - We Can Do This! Annotating Persuasive Acts in Blog Text
This paper describes the development of a corpus of blog posts that are annotated for the presence of attempts to persuade and corresponding tactics employed in persuasive messages. We investigate the feasibility of classifying blog posts as persuasive or non-persuasive on the basis of lexical features in the text and the tactics (as provided by human annotators). Annotated tactics provide subs...
متن کاملa-headers from the As.t.ādhyāyı̄ in Sanskrit literature from the perspective of corpus linguistics
The paper presents strategies for evaluating the influence of Pān. ini’s As.t.ādhyāyı̄ on the vocabulary of Sanskrit. Using a corpus linguistic approach, it examines how the Pān. inian sample words are distributed over post-Pān. inian Sanskrit, and if we can determine any lexicographic influence of the As.t.ādhyāyı̄ on later Sanskrit. The primary focus of the paper lies on data exploration, becau...
متن کاملAn Approach for Grammatical Constructs of Sanskrit Language using Morpheme and Parts- of-Speech Tagging by Sanskrit Corpus
Sanskrit since many thousands of years has been the oriental language of India. It is the base for most of the Indian Languages. Statistical processing of Natural Language is based on corpora (singular corpus). Collection of texts of the written and spoken words is known as Language corpus, which is collected in an organized way, in electronic media for the purpose of linguistic research. It pr...
متن کاملCoarse Semantic Classification of Rare Nouns Using Cross-Lingual Data and Recurrent Neural Networks
The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian language with a corpus grown over four millenia. The proposed method merges lexical information from Sanskrit texts with lexicographic definitions from Sanskrit-English dictionaries, and compares the performance of two machine learning methods for this task. Evaluation concentrates on Vedic, the oldest lay...
متن کامل